Journal reference: Computer Networks and ISDN Systems, Volume 28, issues 7–11, p. 1513.
The present paper proposes that a database can often be organized as a large collection of small text files, each containing the structured information that relates to a particular object. This means that the basic idea of WWW pages - small text files that are cross-linked by symbolic addresses - is generalized to become a database technique as well. The database pages are expressed in HORL notation (HyperObject Representation Language). Just like a WWW browser accesses and reads HTML pages dynamically as it needs them, our WWDB database system accesses and reads HORL pages as it needs them in the course of its data processing operations. When an HORL page is read by the database system, its contents are converted to an internal representation, and it does not have to be read again during the same session.
The WWDB technique can easily be combined with WWW usage. It was developed as a tool for generating WWW pages with structured contents, for example annotated publication lists, directories of authors, journals, and conferences, etc. It has also been put to a second usage in a mail management system. Our experience suggests that it may be an attractive and viable technique for many kinds of low-to-medium duty database applications, in particular, those where the database is a source of information rather than e.g. the basis of a transaction processing system.
In this paper, we first describe the application within which the WWDB concepts were developed, and then proceed to a description of the essential technical characteristics of the existing, experimental implementation. Finally we discuss what conclusions can be drawn from the project so far, and the perspectives for continued use of this technique.
An electronic colloquium generalizes this concept to the electronic arena, and the WWW is an ideal substrate for it. Simply speaking, the colloquium home page offers a menue of specific services, in particular:
The important thing about a colloquium is that it should have very clear focus, and be oriented to a particular research topic. Thus, given that articles addressing the given topic may appear in any one of a large number of journals or conferences, but in each of those only a small percentage of the contents are actually relevant for the colloquium topic, the colloquium will be highly selective. Ideally, it will present all relevant contributions from those sources, and no irrelevant contributions. The colloquium members define the focus and perform the selection.
Electronic colloquia of this kind are particularly important on the European scene, where they offer a possibility for researchers in different countries to obtain continuous interaction with a group of sufficient critical size.
The Compulog project of Esprit (European Union research program in information technology) has recently started an electronic colloquium for spatial and temporal reasoning (ECSTER). This is a sub-area of research in knowledge representation and artificial intelligence, dealing with logical and algorithmic methods for reasoning about actions and their effects, developments over time, etc. Planning, scheduling, and diagnosis based on temporal and spatio-temporal data are some of its application areas. It is an example of a specialized research topic, containing work that ranges from the highly theoretical to the quite practical, where an electronic colloquium would be of interest. The present number of active researchers in Europe in this area is estimated to be around one hundred, most of them working isolated or in small local groups.
For obvious reasons, we chose to use the WWW as the information carrier for ECSTER. At first, experimental versions of the important colloquium pages were set up completely manually, using a text editor, but it became readily apparent that this was inefficient and inconvenient. The problem consisted not only of having to write HTML syntax, but also in the redundancy of the actual information contents: the same data tended to appear repeatedly in multiple contents. Furthermore, in order to have a reasonable order in the accumulated information, it was necessary to organize it in terms of a multi-level directory structure under the operating system being used (Unix), but the chores of locating files in different directory levels was a nuisance in itself. Finally, we wished to have a clear separation between the working version and the public version of each HTML file, so that a maintainer of a page or substructure could work with it to satisfaction, and only then release it for public viewing.
In summary, the practical overhead concerned both the structure within the WWW pages and between them. Furthermore, the same problems arose with other information that we were dealing with, besides the WWW pages. Distribution of published papers is an important function of an electronic colloquium, and the various aspects of a paper (full text, abstract, commentary, annex containing experimental data or software and its documentation, etc.) impose administrative burdens that are fairly analogous to those that arise for the WWW pages in HTML format.
It became clear very soon, therefore, that we needed to introduce a structured representation for these kinds of information. This structured representation should be a database in the sense of having little or no redundancy, so that each essential information element is only represented once, and it should lend itself easily to being processed in operations that combine related information elements. The HTML representation should then be generated from that database, or, to be precise, large sections of the HTML files should be generated from underlying structured data. There must always be some parts which only serve presentation purposes, and which continue to be best written in HTML. For example, an HTML page may have the following essential structure:
We believe that this situation is typical of many "bread and butter" applications of WWW. Naturally, home pages and other pages that a user encounters more or less immediately must be more interesting and less standardized, but as regards those pages that serve a productive purpose, it seems that the information one wants to present is often structured, and can best be generated from an underlying representation. The availability of a richer presentation language, with audio, animation, color, and embedded video capabilities does not change the essential situation: if anything, it will increase the need for a structured representation of the information. We will return to this topic in the final section of the paper.
If HTML pages are generated from underlying data, there is a choice whether the generation is to be done in advance, under the direction of the person editing the data, or on demand as the user accesses the information. The difference is a practical one, since the generation process is quite similar in both cases. In our application there has not yet been any strong reason for on-line generation of HTML, so we have chosen the former alternative so far. The methods proposed below would work equally well in the case of on-line generation, however.
For the representation of the structured data, the most obvious choice might have been to use a conventional database system. However, we chose instead to organize our database using large numbers of small text files, which are expressed in the HORL syntax. The resulting database is still of moderate size, but it has the inherent capability of growth that is suggested by its name, a World-Wide Data Base. We proceed now to describing this design and the reasons why it was chosen.
The WWDB is an object-oriented database in the literal sense that it is organized as a collection of objects each of which has a number of properties. Objects are classified into types; objects of the same type have similar sets of properties. Other notions that are often associated with the term "object-oriented", such as message-passing and inheritance, are not presently used in the WWDB. In database terminology, the WWDB may be described as a binary database.
Each object has a name and a description. The object's name is like an identifier in an ordinary programming language. The description is an expression that maps labels to properties. For example, the combination of the name |France| and the type |countries| may be assigned the following description:
{ CAPITAL ~ |Paris|, CURRENCY ~ FRF, NEIGHBORS ~ { |Belgium|, |Germany|, |Switzerland|, |Italy|, |Spain|, |Andorra| }}where the tilde character is to be read as an arrow, connecting a label and the corresponding property. The description is the real "object"; several objects may have the same name, but for each combination of a name and a type there may be at most one description. (For example, if persons are denoted by their last name, then the combination of |France| and |persons| may represent Anatole France). Properties may be names, or sets of things, but also numbers, strings, sequences of things, new mappings, etc.
So far, this is quite conventional, and it should be clear how one can build a database with authors, publications, universities, cities, countries, journals, conferences, and so on as some of the types. Rather than storing these objects and object descriptions in an ordinary database system, we chose to create one file for each object, and to store the description in textual form in that file. Instead of a database system, we now have a database browser, that is, a program that reads database text files as it needs them.
One of the uses of the database browser is for interactive updating of the database: adding more information, or correcting its existing data contents. Typically, this usage is interleaved with generation of HTML pages. A number of other tasks are also evident, such as for database search, and for consistency controls, but so far the HTML generation task has dominated in our applications.
The full ECSTER structure consists of a number of such pages, which are linked in an approximate tree structure. The public versions and the working versions are linked as parallel structures, so that a public page links to a public subpage, and a working page to the corresponding working sub-page. Only on exit from the structure, for example in references to the full text of an archived article, or the reference to the home page of a researcher, do the parallel structures converge to common points.
Extended versions do not form a third parallel structure. Instead, an extended-version page links to working versions of subordinate or neighboring pages.
The
The on-line reader is invited to visit ECSTER's public home page and the working home page, as well as their respective sub-pages, in order to see how this works. The clickable item "[revision]" has locally the effect of invoking the WWDB system; remote users will only see the Lisp code but not its execution.
As discussed above, each page typically contains some parts that are to be edited directly on HTML level, and some parts that are to be generated from the database. The direct-editing parts are modified in the usual fashion, for example using the editing capability of the HTML browser, or using a plain text editor. (We are currently using Emacs for this purpose). The automatically generated parts are distinguished by the separate quasi-HTML command label, so the text of an HTML page may have the following structure:
<label heading> Automatically generated heading </label heading> Text pertaining to manually written and edited parts... <label contents> Automatically generated part </label contents> More text pertaining to manually written and edited parts... <label footing> Automatically generated footing </label footing>Of course the label and /label commands are ignored by the HTML browsers, and for the maintainer of the page they indicate that whatever goes between <label x> and </label x> shall be left alone, since it will be regenerated anyway.
In order to change the autogenerated information, for example for adding one more author, or updating the information about a particular conference, the maintainer clicks the [revision] link at the top of the working page. This invokes or resets the WWDB browser, which is put in a state where the database object corresponding to the current HTML page is the current object, and |webpages| is the current type. The maintainer may then use the database browser to update the datastructures, and finally invoke commands that regenerate the current HTML page.
Concretely, suppose the current HTML page has the filename
/info/www/ext/brs/researchers/index-wv.htmland that it contains three autogenerated segments as described above. The corresponding database object is stored as the file
/info/www/ext/brs/researchers/index-wv.horlwith contents which may look as follows (simplified form)
{ META ~ { ACCESSPATH ~ "/info/www/ext/brs/authors", OBJNAME ~ |author-index|, FILENAME ~ |index-wv|, PUBLNAME ~ |index|, EXTENDNAME ~ |index-xv| }, TITLE ~ "Catalogue of authors in the ECSTER area", FORMAT ~ |ecster-page|, LANGUAGE ~ |english|, GENERATORS ~ { |heading| ~ (GENERATE |ecster-heading|), |contents| ~ (ALLMEMBERS |authorlist| |author-display|), |footing| ~ (GENERATE |ecster-footing|) }}The name of this database object is author-index; it could not be just index or index-wv since many different HTML pages are called index. This object description contains enough information in order to reconstruct the full file name of the working version, the public version, and the extended version of the HTML page, since both access path and filename are there. It also contains relevant parameters, such as the language in which the page is written, which is needed in order to write language-independent generators. Finally, it specifies the generator methods for regenerating the three auto-generated segments.
To use those methods, the maintainer uses the interactive commands rg and rgp. Writing
rg contentsto the WWDB browser (or selecting the same command from pull-down menues; not implemented at present) when |author-index| is the current object, will cause the current HTML working page to be regenerated, retaining all lines except the text between the lines <label contents> and </label contents>. The expression (ALLMEMBERS |authorlist| |author-display|) in the object description specifies the recipe for this generation process: the object |authorlist| is an object containing a list (ordered set) of authors, which is here used as the basis for generation, and |author-display| is a script specifying how to generate the appropriate HTML expressions based on a given object of type author. (Naturally, |authorlist| may be used in several different contexts). The operation ALLMEMBERS looks up the list of members represented by its first argument, and generates HTML code for each of them in succession using the script of the second argument.
As the WWDB browser is invoked from the working page, only the program and a kernel set of objects are loaded. Additional object descriptions are loaded from their text files as they are needed. For example, the first time the command rg contents is given in the example, it will cause the WWDB browser to load the description of the object |authorlist| from its file, and then in turn it will load the descriptions of all the authors that are in the author-list. The presentation of these authors may in turn require the loading of additional objects. For example, the current affiliation of an author may be represented by specifying a WWDB name, for example |TU-Munich|, and then the corresponding description has to be loaded in its turn. An encouraging observation from the present experimental implementation is that this loading of successive objects can be performed quite rapidly, and does not offer any practical performance problems.
Loaded objects are retained in working memory, so the next time the same rg command is issued they do not have to be re-loaded.
The operation GENERATE, by contrast, is a simpler operation which generates HTML expressions according to the formatting directive of its single argument, in the presence of the current object. For example the |ecster-heading| format will use the HEADING property of the current object for both the HTML TITLE lines and the first-level headings.
Thus the routine of the web-page maintainer is to modify the database, using the viewing and editing commands of the WWDB browser, and to regenerate the HTML working page from time to time using the rg command. Since the WWW-HTML browser and the WWDB browser appear in separate windows on the workstation, it is easy to reload the regenerated HTML page, look at it, and return to the WWDB browser as necessary.
Finally, when the new HTML page or set of pages is satisfactory, one regenerates the public page(s) as well, using the command rgp (for "regenerate public"), as inrgp contents
What has now been described is the basic organization of the system. Additional modifications can easily be introduced into the same architecture. For example, if a given update or set of updates of the database affect a number of HTML pages, it would be desirable to keep track of those dependencies, regenerate all affected pages, and inform the user of what pages have been changed. The existence of a database with flexible datastructures is a correct basis for implementing such services.
For another example, if the manually edited (non-automatic) parts of a working page have been edited, and are to be transferred to public status, then all links to working versions of subpages or other related pages must be replaced by links to public versions. This requires a systematic scan of the entire text contents, either for removing all substrings of the form -wv or (more reliably) removing such substrings if they appear in appropriate context, and otherwise giving a warning message.
The reason for choosing CommonLisp for the experimental system was that the operations of printing and reading datastructures are built into the language, so that the transfer between the text-file representation and the in-memory representation of the object descriptions is trivial. An additional reason was that CommonLisp datastructures lend themselves easily to the implementation of embedded sublanguages, such as the script language for defining HTML generators.
The present implementation is experimental, and has been written
without any particular consideration of efficiency. In spite of this,
it operates with quite adequate speed. The loading time for
the WWDB system is 9 seconds on a Sparcstation 10, provided that
the LAN does not slow it down. (Typically it is only loaded once a
day, and then used repeatedly throughout the day). The time
for regenerating the list of authors, with its current 170 members,
in the
The technique described here can be implemented quite compactly. The following figures for the size of the present program show that it is easy to implement and re-implement a WWDB browser. The figures refer to lines of LISP S-expressions, spatiously printed:
The main limitation of CommonLisp and Xlisp at present is the limited access to screen dialogue capability. An obvious alternative would be to use Java, which would remedy that limitation. On the other hand, it would require a separate implementation of a package for printing and reading datastructures. The same requirement arises when the work is redone in e.g. C++.
For convenience in the development stage, we are using standard Lisp I/O of data structures (that is, Lisp's read and print functions) in parallel with the HORL representation.
The program is freely available. After some additional polishing, we intend to make the program available via ftp and the documentation via WWW.
The additional step that has been taken by WWDB is to make a much more complete separation of content and appearance, and to organize content as a database, while at the same time retaining the distributed text-file organization of WWW. Appearance has not been an issue for the present project: we are satisfied with the appearance capabilities offered by HTML for the time being, which is why we manage by generating HTML pages.
Java, which is generally viewed as the next step of development after HTML, is a programming language, not more, not less. Improved appearance capabilities is its particular strength, so in this way it represents an orthogonal development to the one shown by WWDB. It follows that WWDB and Java together would most likely be a very powerful combination.
One must recognize, however, that an absolute separation of content and appearance is not possible. It must be understood as a guiding principle, and not as a strict rule.
Description-file retrieval within one file system. Briefly, the details are as follows. For every combination of an object name and a type name, the WWDB browser must be able to retrieve the full name of the file containing the object description, so that it can then load the contents of that file. The full name is constructed as (access path) + (file name) + (extension), where the extension is standardized as .horl for the hyperobject representation language used above, and .lsp if the same information is expressed in classical CommonLisp format. The retrieval process assumes that the description of the type is already available; if it is not then retrieval is called recursively with the previous type as the new object, and with |types| as the new type. (Types are a special kind of objects, of course). Then, two main cases are allowed:
(1) The same access path for all objects of the same type. In this case, the type description contains the access path for the members of the type, and the object name serves as file name. The construction of the full file name for the object is trivial.
(2) Each object in the type has its own access path. In this case, the access path is a property of the object, but not of the type. In fact, it is stored as a subproperty under the property META; an example of this was shown above for the case of an object of type |webpages|. The problem, of course, is that as long as the object description is still only stored as a text file, the browser does not know how to find it.
For this reason, WWDB contains the notion of concierges. A concierge, who is a key person particularly in Paris, is someone to whom you mention a name, and he or she will tell you where to go in order to find the person with that name. Similarly, a WWDB concierge is a WWDB object containing a mapping from names to access paths.
Therefore, the retrieval process which is given an object name, a type name, and the description for that type, will first check with the type description whether this type has a single common access path, or individual paths. In the former case, the type contains the access path and the object name becomes file name. In the latter case, the browser will go through all the currently loaded concierge objects and ask each of them whether they have an appropriate access path for the present combination of object name and type name, until it finds one that can provide the information.
Types, in particular, have distributed access paths. The initially loaded WWDB system therefore only needs to load the description of the types |types| and |concierges|, a concierge for all types (that is, all members of the type |types|) that may need to be used initially, and concierge(s) for other relevant objects, for example for relevant members of |webpages|.
Description-file retrieval by world-wide access. The access mechanism with concierges which know about access paths has been generalized to allow arbitrary URL:s, besides local paths. In this way, it is possible to construct a database that is similar to the vast body of displayable information already existing in HTML format. Individual contributions can be set up locally and made available on the Internet, and these contributions can be accessed and used by the database browsers of other users regardless of where they are. The power of this concept is that the usage of the information is not limited to viewing; it can also be processed, combined with information from elsewhere, and presented in very flexible ways.
The usage for electronic colloquia is important enough, but we foresee that the same technique can be used for much broader purposes. Imagine, for example, a world-wide database containing geographical and historical information: countries, cities, activities in those cities, historical events, and so on. It would be reasonable to start with fairly elementary facts, and then to extend the database by gradually attaching additional information to existing ones. A world-wide database with those kinds of contents could develop into an encyclopaedia that is available freely to everyone (in the same sense and to the same extent as the present WWW is free). More specialized knowledge bases in various academic disciplines might use the same technique.
Some additional constructs would be necessary as the world-wide database becomes larger and larger. The present design requires all participating partners to use the same naming scheme for types and for concierges. The distributed system may accomodate multiple descriptions for a given combination of object name and type name, for types with individual access paths, as long as each user only selects a subset of all available concierges. In this way the user only "sees" one of the descriptions for each object/type combination. But in a world-wide context, it may be necessary to accomodate different uses of the same type name or the same concierge name concurrently. One plausible way of doing that is to allow multiple domains, where each domain consists of a set of information providers, and the present naming scheme is used within each domain. For information exchange between domains, one would use the well-known technique of mediators, that is, devices that translate a query that has been issued in one domain as an object/type combination, into a corresponding query in another domain.
The WWDB approach represents a deviation from this traditional mode of thinking. It allows data to be represented in small and simple text files whose contents are open to everyone. One is not dependent on the continued use of a particular database software; it is very easy to implement and re-implement support for the HORL format. This has been demonstrated by the moderate size of the present operational program, where the program kernel is a mere 12 pages of code.
Besides bringing independence from any particular software, the compactness of the WWDB design has another important effect: access to the world-wide database can easily be integrated with any user interface, be it a conventional WWW browser, a UIMS, a document preparation system, or a particular application program.
The TSIMMIS project at Stanford University [Garcia-Molina et al, 1995] advocates a tagged object model which is similar to our view of data. However, the notation that is used in TSIMMIS is on a quite low level compared to the set-theoretic notation used in our HORL. TSIMMIS do not report using access paths or URL:s as first-class objects in their database.
The WWDB approach goes against current trends in the database area in another respect as well: large main-memory databases are presently a subject of considerable interest. Although this is important for many applications, one can not hold a world-wide database in-core. The WWDB approach uses a browser-like database tool that loads HORL pages as it needs them.
What are the disadvantages of the WWDB approach? One of the major issues in traditional database technology is data consistency and integrity: a database system shall contain type declarations for the data it contains, and various control mechanisms for verifying the structural correctness and the consistency of those data. In a WWDB, we make a virtue of necessity and consider data consistency and integrity as a separate issue. Anyone who posts information as an information source in the WWDB will have to make his own commitments as to the structural properties of the data he provides. In some cases this may be a very small issue. In other cases it may not be sufficient, and then the WWDB approach is not appropriate for those cases.
Concurrent update is another although related topic. The WWDB approach is oriented towards the assumption that the data are object-oriented, and that different information providers make non-conflicting contributions and updates to the body of object descriptions for name/type pairs. The classical example from transaction data processing - making a withdrawal from one account, and a corresponding deposit to another account - would obtain miserable performance in the WWDB architecture.
Actually, one observation from our Electronic Colloquium project has been that the traditional list of references in scientific articles is likely to become an obsolete construct in the age of electronic publication. Why should one freeze the reference list into the article; why not generalize it into a bibliographic reference structure which connects articles by binary links, and which can be gradually incremented over time, even after the article has been published?
The homepage of the WWDB project: [http://vir.liu.se/brs/database/]
The homepage of the ECSEL electronic colloquium: [http://vir.liu.se/brs/]
The author's homepage: [http://www.ida.liu.se/~erisa/]
Hector Garcia-Molina, Joachim Hammer, et al: Integrating and Accessing Heterogeneous Information Sources in TSIMMIS. Presented at the AAAI Symposium, 1995. Also available on-line in [postscript].
Guy L. Steele Jr: Common LISP. The language. Digital Press, 1984.